Computational Intelligence for Observation and Monitoring: A Case Study of Imbalanced Hyperspectral Image Data Classification

Imbalance in hyperspectral images creates a crisis in its analysis and classification operation. Resampling techniques are utilized to minimize the data imbalance. Although only a limited number of resampling methods were explored in the previous research, a small quantity of work has been done. In this study, we propose a novel illustrative study of the performance of the existing resampling techniques, viz. oversampling, undersampling, and hybrid sampling, for removing the imbalance from the minor samples of the hyperspectral dataset. The balanced dataset is classified in the next step, using the tree-based ensemble classifiers by including the spectral and spatial features. Finally, the comparative study is performed based on the statistical analysis of the outcome obtained from those classifiers that are discussed in the results section. In addition, we applied a new ensemble hybrid classifier named random rotation forest to our dataset. Three benchmark hyperspectral datasets: Indian Pines, Salinas Valley, and Pavia University, are applied for performing the experiments. We have taken precision, recall, F score, Cohen kappa, and overall accuracy as assessment metrics to evaluate our model. The obtained result shows that SMOTE, Tomek Links, and their combinations stand out to be the more optimized resampling strategies. Moreover, the ensemble classifiers such as rotation forest and random rotation ensemble provide more accuracy than others of their kind.


Introduction
In recent times, images have been one of the prime data sources. Hyperspectral images (HSIs) are currently in trend due to the enormous amount of information it captures in an earth surface scene. HS data are one type of data that can be used in various ways to develop human technology [1]. HSI refers to spectral imaging data acquired by satellites equipped with airborne spectrometers. e photographs take over specific earth surfaces, referred to as the scene, containing various land cover classes such as flora, concrete, and water bodies. Because each related land cover occupies a varied surface area, the number of pixels representing each class varies. However, HS data have various difficulties, including noise, quality and quantity of labeled data, dimensionality, and categorical sample imbalance [2]. Additionally, analyzing and interpreting HS data necessitate several processes, including denoising, lowering hyper-dimensionality, spectral unmixing, and, most critically, identifying land cover [3]. e classification of the imaging scene has been a preoccupation of professionals from the inception of hyperspectral data. Initially, they used statistics-based classifiers in conjunction with some preprocessing techniques. e categorization problem became easier to handle with breakthroughs in ML and the introduction of DL. It provides an excellent strategy to deal with the dataset's embedded concerns [4].
Imbalanced data refer to classification challenges in which the classes are not equally represented; the major class is the most common, while the minor class is the rarest [5]. Information security, medical imagery, bioinformatics, network intrusion, and fraud detection are just a few examples of real-world datasets that suffer from imbalanced classification [6]. e HSI dataset is also skewed since insufficient data instances belong to either of the class labels due to their different land area coverage. e sample distribution per class can range from a little imbalance to a severe imbalance, with few samples in the minor class and hundreds in the major class. Two basic criteria can also be used to demonstrate the HSI imbalance: (1) the minor class shortage of knowledge and (2) the imbalance ratio (ImbR), i.e., the ratio between the minor and major classes [7]. e mathematical formula is given as follows: ImbR � Number of class samples belonging to minor class Number of class samples belonging to major class . (1) Due to its complicated data structure, HSI eventually confronts an imbalanced classification challenge. Unbalanced classifications complicate predictive modeling because most machine learning algorithms for classification were created with an equal number of samples per class in mind. As a result, models with lower prediction accuracy arise, particularly for the minor class, posing a problem because the minor class is frequently more significant than the major class. As a result, categorization errors are more probable to occur in the minor class than in the major class [8]. Because of the dataset's inherent complexity, learning from it demands new views, approaches, concepts, and methods for changing data. e most effective way to address the data imbalance is to resample the data instances to roughly equal proportions. e three types of sampling procedures employed are oversampling, undersampling, and hybrid sampling. Oversampling entails taking identical random data samples from the minor class, leading to overfitting. In contrast, undersampling entails removing random knowledge from the major class, resulting in information loss. ere are also hybrid balancing strategies that use a collaborative effort between an oversampler and an undersampler to balance samples from each class in the same dataset [9]. e hybrid sampling method combines an oversampler and an undersampler that balances the dataset. After correcting the imbalance, a suitable model must be used to train the dataset. Logistic regression (LR), naive Bayes (NB), support vector machine (SVM), and tree-based classifiers are examples of benchmark machine learning algorithms suitable for moderately balanced datasets [10]. Ensemble learning methods have become increasingly prominent in the latest years. e primary objective of those systems is to increase performance by aggregating the findings of multiple weak classifiers. ese systems employ a voting technique amongst all the weak classifiers to obtain the ultimate classification result [11]. A decision tree (DT) [12] is considered to be the most preliminary bagging technique. DTfor each subset of the original dataset has been created individually. Finally, a voting mechanism was used to determine the final result among those DTs. Random forest (RF) [13] is the most widely utilized tree-based ensemble classifier based on the bagging approach for both classification and regression. e insensitivity of RF to spectral bands and its ability to handle missing and imbalanced data are two of its most enticing characteristics. It can also be used on noisy samples because it does not overfit the data easily [14]. e extra trees (ET) or extremely randomized trees [15] approach works by producing many unpruned decision trees from the training dataset. In contrast to classification, predictions are created for regression by taking averages of the prediction formed by the subordinate DTs, whereas for classification, the rule of majority voting is applied. Unlike RF and bagging, which build each DT using a bootstrap sample of the training dataset, ET fits each DT to the whole training dataset. Rotation forest (RoF) [16] algorithms outperform bagging on noise-free and imbalanced data. Compared with bagging and RF, RoF can achieve similar or more excellent results with fewer trees. Blaser and Fryzlewicz [17] proposed an ensemble of random and rotation forests in the name of random rotation ensemble forest (RREF). e random rotation efficiently creates a new coordinate system belonging to each base learner, increasing ensemble variety without sacrificing accuracy. Moreover, one significant distinction between the random rotation and the random projection is that rotations are reversible, meaning no information loss. e premise that motivated us to pursue our research work is a broad analytical study of the prevailing resampling techniques and their impacts on the hyperspectral images, viz. oversampling, undersampling, and hybrid sampling. For oversampling, four useful techniques are selected, namely random oversampling (ROS) [18], synthetic minority oversampling technique (SMOTE) [19], borderline SMOTE (B-SMOTE) [20], and adaptive synthetic minority oversampling technique (ADASYN) [21]. Furthermore, four popular undersampling methods are studied, namely random undersampling (RUS) [22], Tomek Links (TLs) [23], neighborhood cleaning rule (NCL) [24], and edited nearest neighbor (ENN) [25]. Finally, two-hybrid sampling techniques are considered: SMOTETomek [26], a combination of SMOTE and TL, and SMOTEENN [27], a combination of SMOTE and ENN. ese strategies are used to balance the dataset that will be learned through training. ese balanced datasets are then passed to the classifiers as train inputs for categorizing the land covers in the HS scenes. In this work, we have used the eminent tree-based ensemble classifiers that are demonstrated for their compatibility with synthetically balanced, huge-dimensioned HSI datasets. e ensemble classifiers employed here are DT, ET, RF, RoF, and RREF. ey are utilized to construct the entire comparison study where each of these classifiers is assessed using all the resampling techniques for individual datasets. e quality and performance of the model are evaluated over testing data using various metrics such as precision score, recall score, F score, overall accuracy, and Cohen kappa score. 2 Computational Intelligence and Neuroscience Our work provides the following contributions: (a) A comprehensive adaptation of conventional resampling algorithms to correct the hyperspectral datasets' substantial imbalance. (b) A novel approach to classify hyperspectral images using efficient tree-based ensemble learning methods, i.e., the traditional tree-based techniques and the hybridized and modified forest methods to structure the classification model. (c) An advanced comparative investigation of various oversampling, undersampling, and hybrid sampling strategies applied to hyperspectral datasets shows that they positively influence rectifying the majorminor sample imbalance. (d) e innovative, thorough comparison of the applications of the included tree-based ensemble classifiers in categorizing the surface covers is captured in the HSI in terms of different performance assessment metrics. In addition, these classifiers are capable of learning joint spectral-spatial features of the balanced datasets. (e) e comparative case study with all the resampling techniques used for HSI datasets can be carried forward to the other computational intelligence and monitoring applications with several types of datasets, especially in the area of medical imagery. (f ) e study provides more excellent knowledge for the researchers who deal with big data and voluminous imagery data that suffer due to imbalanced samples. e comparison depicted here benefits choosing the more appropriate strategy to work with different datasets and provides a better view of improvising those strategies. e remaining paper is divided into the following categories: Section 2 describes previous work in the area of resampling for various imbalanced datasets, Section 3 depicts our research work's methodology, Section 4 illustrates the model evaluation and test results, and Section 5 provides the conclusion and also deliberates the research work's limitations and future scope.

Previous Works
Unbalanced data have produced numerous problems in the classification of hyperspectral images. Researchers have employed preprocessing techniques to deal with the issue of class imbalance in several application fields. Preprocessing approaches for the imbalanced class problem include datadriven methodologies, such as sampling. Oversampling, undersampling, and hybrid sampling are the three types of sampling. e suitable sampling technique to be chosen is determined by the dataset, the sample size of each class, and their ratio of imbalance (IR). Figure 1 depicts the overall changes in the major and minor class samples in a dataset using three different sampling procedures. One class, i.e., the major class, is dominant over another class considered minor in the original dataset. In oversampling, the minor class samples are overpopulated to match the number of major class samples by creating new synthetic instances in the neighborhood of existing samples. On the other hand, undersampling techniques remove the linked and redundant major class samples to bring balance. However, hybrid sampling incorporates both strategies to eliminate the imbalance in the dataset.
Oversampling, also known as upsampling, is a sampling technique that helps to balance a dataset by duplicating minor class examples. is procedure has the advantage of causing little or no data loss. is approach has the problem of causing overfitting and adding to the computational load. e two types of oversampling are ROS and informative oversampling. ROS is a technique for balancing the distribution of minor classes by randomly repeating minor class examples.
e informative oversampling technique [28] synthesizes minor class samples depending on a predefined criterion. Several applications of oversampling have been deployed for various types of datasets in recent years. A linear SVM was used in conjunction with a few SMOTE variations to detect malware in [29]. e model synthesizes dangerous occurrences based on the signature from the standpoint of the nominal properties. e malicious traffic dataset is first clustered using a single-linkage hierarchical technique to enrich the malicious class dataset, and then, signatures produced by every harmful traffic cluster are used. e resulting balanced datasets are then applied to train a semantic malware detection model for mobile devices. Random forest was used as the classifier with SMOTE to overcome the massive data imbalance problem. With a constrained hyperparameter set and nondynamic oversampling rate, SMOTE is used to eliminate imbalance after binarizing the original dataset [30].
is work fails for multiclass scenarios. In [31,32], the same combination was used to detect insurance fraud claims and predict depression in women due to the modern lifestyle. B-SMOTE and SVM with kernel sigmoid were employed for data augmentation on P300 users with poor BCI performance [33]. For the DEAP dataset, the 1D-CNN model was utilized for classifications of two emotional dimensions: valence and arousal. B-SMOTE was employed to acquire a more homogeneous set of features of EEG signals [34]. For classifying HSIs, rotation forest has been combined with dynamic SMOTE [35], where SMOTE is applied to the imbalanced classes before each rotation tree is constructed. e procedure was discovered to take a long time. is work is expanded upon in [36], where the SMOTE technique is employed to create balanced datasets by incorporating spatial information from surrounding pixels of samples. ese datasets are loaded into the weighted rotation forest model, which combines the RoF and multilevel cascaded RF. e cascade forest receives the rotation feature vectors generated by the rotation forest. In addition, the output likelihood of every level and the original data forms a stack. Furthermore, the sample weights need to be adjusted on a regular basis using the dynamic weight function generated from the classification scores at each level. According to [37], the adaptive synthetic sampling is another excellent strategy for oversampling when combined with a convolutional neural network to detect intrusion in a Computational Intelligence and Neuroscience 3 wireless network. ADASYN prevents the model from being sensitive to large samples but insensitive to small samples, which can help in small sample recognition and learning. e model considerably improves multiclassification jobs. However, a simpler residual network is still needed to increase small sample identification accuracy and execution performance. Another attempt at intrusion detection is made in [38], where ADASYN is applied to oversample the training dataset to enhance the number of infiltration and heartbleed attack behavior data samples. Classification and regression trees (CARTs) were used to create the DTs for the RF approach with a Gini coefficient. Even though this method delivers more incredible prediction performance, efficacy, and resilience, the parameters of ADASYN were adjusted artificially. All the above works that include oversampling suffer from a replicated voluminous data problem that is sometimes redundant, which leads to a common issue of high storage, calculation, and time complexity. Undersampling, also known as downsampling, is a proper data balancing technique. e classifier obtains training using a major class subset in this method. When we undersample, some samples are deleted from the major class. Random sampling and informative undersampling are the two types of undersampling algorithms. e principle of random undersampling is simple: samples from the major class are randomly removed until the dataset is balanced. e informative undersampling technique selects only the necessary major class instances based on a prespecified selection criterion to balance the dataset [28]. Random undersampling [39] has the most relevance in huge data settings since it aids the random forest in making more accurate classifications in less time. In [40], the same methodologies are used for extensive specialized data for bioinformatics, where feature selection (FS) is used in conjunction with RUS, and the relationship between the two is investigated. e random forest learner is used in the FS component to compute feature importance, and encoding is used to transform categorical features into duplicate variables. RUS has a speedier runtime and a lower computing burden than random oversampling; however, the classification technique requires to be appropriate . Tomek connections have been demonstrated in [41], where TLs are used to eliminate outliers, noisy, and redundant samples from the major class of 10 real-world datasets.
e removal of potentially ineffective examples causes the decision boundary to shift towards the minority region, providing a favorable environment for learning on various classifiers, including SVM. To create an applicationoriented multiclass real-life application, the model must be supplemented with multiple schemes/techniques to eliminate the majority of instances with minimal data loss and faster processing. For the overpopulated bacterial data, in the preparation phase, the TL algorithm is used to clean data and reduce noise and produce a better result than oversamplers [42]. In addition, Tomek linkages are utilized to correct the imbalance in some medical datasets [43], where balanced data are put into the stacking ensemble after downsampling. It works on two levels. At level 0, there are many different classifiers, such as NB and SVM. e level 0 output is given to the level classifier for the final forecast. Base classifiers such as LR, k-NN, and NB are applied to apply datasets that are not more accurate and specific than contemporary research. e works discussed above used limited datasets and basic classifiers. On the one hand, undersampling might reduce the computational burden, but on the other hand, it may also remove significant information that might produce a better outcome.

Computational Intelligence and Neuroscience
Hybrid sampling is an appropriate combination of oversampling and undersampling procedures that correctly balance the training data. e oversampling strategy is used to create fresh samples by randomly sampling the current training data with replacement. Because the minor class is oversampled, a new balanced training dataset is created. e undersampling method is then used to reduce unwanted overlap between classes, lowering the number of classes. As a result, until a more assertive threshold for classifier conclusions could be created, the majority of data was eliminated [26]. Despite the small numbers, there is a significant study in this sector. Intrusion detection is carried out in [44] utilizing a mix of synthetic minority sampling and a neighborhood cleaning rule. For the learning process to be unaffected by data distribution, SMOTE generates a small number of datasets. Furthermore, the explored dataset revealed that border and noisy data have an impact on classification performance. As a result, NCL rules remove noisy and boundary data from the oversampled data. C4.5 and SVM are used as classifiers, but the model is not robust. e same hybrid resampling technique is used in conjunction with logistic regression [45]. is model is proven to be the most effective for binary categorized datasets. Another hybrid strategy combines synthetic oversampling with Tomek linkages, which have been used to detect fake credit cards [47] and medical disease datasets [46]. Overall, hybrid techniques are more prone to data loss and consume additional time.
Both works are based on a comparison of results from various classifiers. e student sadness data are distributed across universities using a combination of random oversampling, Tomek connections, and random forest. Only binary and less noisy datasets are adequate for this model [48]. A hybrid of SMOTE and ENN is used to process the KDDCup99 dataset and tackle the difficulties of data imbalance and sample overlapping with the classifier RF [49]. e same combination but with classifier XGBoost is carried forward to build a prediction model that efficiently determines the category of a person, whether healthy or possessing Parkinson's syndrome [50]. RRE pruning is used for HSI classification that prunes the constituent classifiers with poor complementarity, and subsequently, the leftover constituent classifiers with higher complementarity are joined to produce an ensemble classifier. ese strategies ensure that the component classifiers used to build the ensemble classifier are precise but diversified, which enhances the ensemble classifier's performance [51]. Figure 2 displays the framework of our suggested approach for improving hyperspectral image categorization by coping with sample imbalance. e following are the steps that are included in structuring our study.

Data Preprocessing
3.1.1. Dataset. For our experiment, we collected three mostly explored hyperspectral datasets that are available in the public domain [52] and stored in the memory of our system. A brief elaboration of the datasets is as follows, along with Figure 3.

Loading the Dataset and
Splitting. e datasets are imported as hypercubes and converted into a processable three-dimensional format. en, the 3D images are reshaped into a machine-readable 2D format. e dataset is further broken up into training and testing datasets in a ratio of 3 : 2; i.e., we have used 60% of the original individual datasets for training our model, and the residual 40% is set aside for testing the model's performance. e training set is processed from the next step onwards, while the testing dataset remains intact. e training and testing samples for each dataset are depicted in Table 1.

Oversampling Techniques
(1) Random Oversampling (ROS). ROS [18] involves selecting random examples from the minority class with replacements and augmenting the training data with numerous copies of the particular instances so that a specific 6 Computational Intelligence and Neuroscience instance could be chosen many times. Overfitting has been shown to be more likely when ROS is applied.
(2) Synthetic Minority Oversampling Technique (SMOTE). Chawla et al. [19] presented SMOTE as an oversampling strategy to avoid the overfitting problem. is method is considered cutting edge and is effective in a wide range of applications, including the HSIs. is approach generates synthetic data based on feature space resemblances between prevailing minority occurrences. Making an artificial instance determines each minority instance's k-NN, chooses one at random, and then uses linear interpolation to create a new minor instance in the neighborhood. e detailed algorithm of SMOTE is as follows: Step 1: k-nearest neighbors are calculated with minor class samples following Euclidean distance for each minority instance x i .
Step 2: a neighbor x j is picked in a random manner from the k-nearest neighbors of x i .
Step 3: new samples x new are produced in between x j and x i : where β is the random number between 0 and 1. [20] creates a synthetic sample dividing minor and major groups. is method also aids in the division of the minor and major groups. e minor class observations are first classified using this approach. If all of the neighbors are in the major class, it identifies any minor observation as noise and ignores it while synthesizing synthetic data. Furthermore, it resamples completely from a few border locations that include major and minor classes as neighborhoods.

(3) Borderline SMOTE (B-SMOTE). B-SMOTE
(4) Adaptive Synthetic Minority Oversampling Technique (ADASYN). Haibo He et al. [21], inspired by SMOTE, present ADASYN technique, which has received considerable attention. ADASYN generates minor class samples based on their density distributions. Compared with minority class samples that are simpler to learn, more artificial data are produced for minor class samples that are challenging to learn. It computes each minor instance's k-NN and then uses the class ratio of the minor and major examples to produce fresh samples. It adaptively alters the decision boundary to concentrate on those samples that are challenging to learn by repeating this process. ADASYN enhances learning of data distribution in two ways: (1) minimizing the bias created by the class imbalance and (2) adjusting the classification decision boundary in the direction of the complicated examples. Training dataset DTR is presumed with n samples {x i , y i }, j � 1, . . ., n, where x i is an example belonging to the n-dimensional feature space X and y j ∈ Y � {1, −1} defines the class label coupled with x j , and n s and n l denote the number of minority and major class examples, respectively. us, n s ≤ n l and n s + n l � n. Based on these notations, the following steps are to be followed: Step 1: compute the degree of class imbalance: i � n s /n l , range of i ∈ (0, 1].
Step 2: compute the quantity of synthetic data samples that the minor class needs to produce: where β ∈ [0, 1] is the constraint for specifying the required balance level after the synthetic data creation.
Step 3: for every sample x j ∈ minor class, find k-nearest neighbors found on the basis of Euclidean distance in n-dimensional space, and compute the ratio r i described as follows: where Δ i denotes the amount of samples in the knearest neighbors of x i that belong to the major class; thus, P j ∈ [0, 1].
Step 4: standardize P j according to P j � P j / n a j�1 P j so that P j denotes a density distribution ( j P j � 1).
Step 5: calculate the exact number of synthetic data samples that need to be produced for every minority sample x j : where R is the overall figure of synthetic data instances that needs to be produced for the minor class.
Step 6: for every minor class data instance x i , produce r i synthetic data samples by choosing one minor data example in a random way, x zj , from the k-nearest neighbors for data x j : where (x zj − x j ) is the contrast vector in the n-dimensional space, and α is an arbitrary number: λ ∈ [0, 1].  [22]. is approach aims to choose and eliminate samples from the major class at random, diminishing the number of examples in the modified data from the major class. However, RUS has the significant disadvantage of discarding useful information.
(2) Tomek Links (TLs). TL is a variant of Tomek's condensed nearest neighbor (CNN) undersampling algorithm [23]. Unlike the CNN technique, which selects samples with their k-NNs from the major class that has to be deleted at random, the TL method employs a rule to select pairs of observations (suppose A and B) that meet the following criteria: (1) the observation B is A's closest neighbor; (2) the observation A is B's closest neighbor; and (3) observations A and B are from distinct classes; i.e., A and B are members of the minor and major classes, respectively, or vice versa.
(3) Neighborhood Cleaning Rule (NCL). NCL [24] is an undersampling strategy that reduces data based on cleaning to overcome the imbalanced class distribution. One of the benefits of NCL is that it examines the data quality to be destroyed rather than focusing solely on the reduction in data. e data cleansing procedure is intended for samples from major and minor classes. Essentially, NCL is built on the notion of one-sided selection (OSS), a technique to reduce data based on incidences to decrease classes carefully. On NCL, the cleaning data process is conducted independently of the major and minor samples.

(4) Edited Nearest Neighbor (ENN).
e ENN approach, which was developed by Wilson [25], works by first determining the k-NN of every observation and then determining whether the major class from the observation's k-NN is the same as the observational class or not. If the observation's k-NN's major class differs from the observation's class, the observation and its k-NN are removed from the dataset. is method is more potent than TL because ENN removes the observation and its k-NN when the observation's class and the major class from the observation's k-NN are different, rather than simply the observation and its 1-nearest neighbor. As a result, ENN is likely to provide more thorough data cleaning than TL.
is method, first proposed by Batista et al. [26], blends the SMOTE's ability to create synthetic data for the minor class with the TL's ability to eradicate data from the major class that is identified as TL, i.e., samples of data from the major class that is nearest to the minor class data.
(2) SMOTEENN. is method, established by Batista et al. [27], merges the ability of SMOTE to generate synthetic examples for minor classes with the ability of ENN to delete some observations from both classes. ose observations are identified as having different classes between the observation's class and its k-NN major class.

Tree-Based Ensemble Classifiers
(1) Decision Tree (DT). e most widely used supervised data mining approaches is the DTalgorithm [12]. DTuses a divideand-conquer approach. e operating method is to find a feature possessing the best ability to classify and split data into many subsets in a recursive manner until a stopping criterion is fulfilled. e class is predicted using decision rules derived from the data input. Determining attribute selection parameters such as information gain or Gini index, the root represents the best feature. It can work with both numerical and categorical data. Furthermore, outliers and the missing values have a negligible impact on the model's results. However, DT uses a greedy technique, which might lead to overfitting [53].
e DTalgorithm can be applied as a feature selection strategy in addition to a classification method [54]. e features used to construct splitting rules at internal tree nodes are DT feature selection results. DT is a filter strategy because it measures features rather than classification accuracy.
(2) Random Forest. (RF) RF is a well-known ensemble ML approach that stems from DT [13]. While building a model, it can manage the overfitting branch of DT. As a result, many classification models are created, each constructed using a feature selector such as the information gain, Gini index, and gain ratio. ese models realize and create an impact on the prediction in a discrete manner [53]. Random sample selection and random feature selection are the essential concepts. All trees in RF are independent of one another, allowing for parallel training and testing. Consider the dataset S n , which contains n samples (U, V), with U ∈ R S . To begin with, m instances are randomly chosen with replacements from the original dataset S n . e current decision tree is built using these examples. Second, from the initial S features, p features (p < S) are picked at random. CARTs are produced using the Gini impurity or mean-squared error criterion. Finally, using the majority vote criterion, the categorization result is produced [36].
(3) Extra Trees. (ET) ET [15] is an ensemble learning operating mechanism like RF. ET creates classification and regression by combining the results of a large number of uncorrelated trees. e first of two key differences between ET and RF is that ET samples do not require replenishment. e second is that it chooses random attributes to split the tree nodes rather than the best [53]. Furthermore, ETis preferable to RF in the sense that it is faster and allows for very little noise data.
(4) Rotation Forest. (RoF) Rodriguez proposed a rotation forest in 2006 [16], based on the random forest concept. Feature transformation being the basic idea behind this algorithm, it aims to enhance the difference and accuracy of the underlying classifier. e following steps are used to create a T-size rotation forest model.
Step 1: e feature space denoted as F is segregated into N disjoint sets of features, and every subset contains features of K (�F/N) number.
Computational Intelligence and Neuroscience 9 Step 2: A new train set is attained by utilizing a bootstrap algorithm to select 75% of the training data randomly.
Step 3: the coefficient , and the coefficients belonging to each subspace are arranged in a sparse rotation matrix R v (v ≤ V).
Step 4: e columns of R v are rearranged by duplicating the order of initial features F to produce the rotation is produced for training a specific classifier.
Step 5: e process mentioned above is repeated on all different train sets and a sequence of specific classifiers is produced. e majority voting rule achieves the final result. [17] in 2016. Regardless of utilizing the identical sequence of the random numbers in the RF algorithm's tree induction phase, similar bootstrap samples and associated feature subset selections at every decision branch for both the trees are made. e random feature rotation has a considerable influence on the resultant data partition. e subsequent tree is not simply a rotated form of the unrotated tree; it has an entirely different orientation and data division. Samples are consistently distributed throughout all possible rotations to execute a random rotation. For n > 2, where n denotes the number of independent normal variates, rotating every angle in spherical coordinates at random does not result in a consistent distribution over all the rotations, implying that some rotations are more common than others.

(5) Random Rotation Ensemble Forest. (RREF) e idea of RRE was proposed by Blaser and Fryzlewicz
Let us consider x as the unit vector directing towards the n-spherical space at an arbitrary point. e classification trees T split the predictor space into D i disjoint regions, with 1 ≤ i ≤ I, where I denotes the total count of the terminal nodes of T. e random and optimization parameters are denoted by α � {G, w} and β � D i , v i I 1 , respectively, where G is the random rotation coupled with T, w is the arbitrary tree induction sample pairs, and every randomly rotated input G (x) performs a mapping to a constant v i , which depends on the belonging of the input to a particular region of D i . us, the tree with an indicator function J (.) formulates as follows:

Model Simulation.
A classification model is built using each of the 10 resampling techniques taken one at a time. e balanced data are then classified using each of the 5 tree-based ensemble classifiers separately, as shown in Figure 2. erefore, our comparative experiments consist of a total of 50 training models for each dataset. Every model construction requires a suitable hyperparameter setting. For oversampling, we have mostly chosen the minor classes to create a similar number of samples as the major classes; thus, the sampling strategy is "minority." Correspondingly, the sampling strategy for undersampling techniques is chosen "majority," where the redundant major class samples are removed to match up with the minor class samples. We first set the base oversampler and undersampler with previously stated parameters and then set the sampling strategy as "all" to the hybrid sampler for the hybrid sampling. Also, for SMOTE and ADASYN, we have taken 5 k-nearest neighbors, whereas for NCL and ENN we chose 3 k-nearest neighbors. e hyperparameters that are set for the classifiers are shown in Table 2.

Experimental Setup.
All program codes are implemented using Python language with its latest versions of embedded packages, such as Keras, TensorFlow, and scikit. e hardware specifications are Intel ® Core ™ i5-10300H Processor, 2.5 GHz, 8 GB DDR4 2933 MHz RAM, and 4 GB NVIDIA GeForce GTX 1650 Ti. After splitting the original dataset into train and test sets, we (1) apply the resampling techniques, viz. 4 oversampling, 4 undersampling, and 2 hybrid sampling strategies to our datasets individually; (2) train each of the tree-based ensemble classifier models with each resampling method; (3) obtain the classification performance using the assessment metrics; and (4) present a detailed comparative analysis based on the obtained metric statistics. We have uniformly overpopulated the selected minor class samples and removed neighborly linked major class samples to balance the individual dataset. e same strategy is used in combination for the hybrid sampling. e hyperparameter setting for the classifiers is as follows: (1) For DT, ET, RF, and RREF, we have used the Gini criterion with a maximum tree depth of 100 and the number of estimators as 1000, keeping the other hyperparameters as default.
(2) For RoF, we have taken 1000 number of trees and a total of 20 features, keeping the rest as default.  Computational Intelligence and Neuroscience

Performance Evaluation Metrics (PEM).
In this work, we have adopted five prime metrics to assess our model's performance: precision, recall, F score, Cohen kappa, overall accuracy, and the time elapsed to execute the entire process. ese are described as follows.
Let us denote Y�the total number of class labels in the dataset, bii � true prediction of ith class, b ji � false prediction of ith class, and b ij � false prediction of ith class into jth class.
(1) Precision Score. Precision is used to assess each class classification accuracy in the imbalanced data. e precision score is expected to be high for a better classifier. e precision score (Prec_score%) measures the testing prediction rate of all samples and is defined in the following equation: (2) Recall Score. Recall or true-positive rate is the percentage of correctly classified events. e recall is especially well suited to assessing classification systems dealing with many skewed data classes. e large recall value indicates the better performance of a classifier. e Rec_score% is given by the following equation: (3) F_Score. In the classification of imbalanced data, the F-measure, an assessment index derived by combining precision and recall, has been widely employed. e introduction of F-measure combines the two, and the greater where R i and P i denote the precision and recall of class i, respectively.
(4) Cohen_Kappa_Score. Cohen kappa is a statistic that evaluates the predictability of the findings and determines whether the consistency is genuinely random. e greater the Cohen_kappa, the better the classifier's performance. Kappa_score% is equated as follows:  where q i and q i ′ denote the original and predicted sample sizes of class i, respectively.
(5) Overall Accuracy. Overall accuracy, being a performance-metric, assigns the similar weight to every of the data types, regardless of their number of instances. e definition of OA is given as follows:

For Indian Pines Dataset
(1) Effect of Oversampling. Table 3 describes the comparison between the tree-based classifier model performances due to the oversampling of the data. It is evident that SMOTE achieves better results in all performance metrics for all classifiers. RREF attains the highest accuracy of 89.49%, with an approximation of 1.32, slightly higher than RoF. Also, the total time consumed is maximum for ADASYN for all the classifiers, especially for RoF.  (2) Effect of Undersampling. Table 4 describes the comparison between the tree-based classifier model performances due to the oversampling of the data. It is imperative that TL achieves better results in all performance metrics for all classifiers, except for DT, where NCL attains the highest OA. RREF attains the highest accuracy of 85.32%, with an approximation of 2.76, which is slightly higher than RoF. Also, the total time consumed is maximum for TL for all the classifiers, especially for RoF.

For Salinas Valley Dataset
(1) Effect of Oversampling. Table 6 compares the tree-based classifier model performances due to the oversampling of the data. ADASYN achieves better results in all performance metrics for all classifiers. RREF attains the highest accuracy of 95.84%, with an approximation of 1.33, slightly higher than RoF. Also, the total time consumed is maximum for ADASYN for all the classifiers, especially for RoF.
(2) Effect of Undersampling. Table 7 describes the comparison between the tree-based classifier model performances due to the oversampling of the data. It is imperative that TL achieves better results in all performance metrics for all classifiers. RoF attains the highest accuracy of 95.41%, with an approximation of 1.5, which is slightly higher than RREF. Also, the total time consumed is maximum for TL for all the classifiers, especially for RoF.   Figure 5: OA% comparison between TL and NCL associated with the tree-based ensemble classifiers for IP, SV, and PU datasets. 16 Computational Intelligence and Neuroscience

For Pavia University Dataset
(1) Effect of Oversampling. Table 9 depicts the comparison between the tree-based classifier model performances due to the oversampling of the data. It is evident that SMOTE achieves better results in all performance metrics for all classifiers. RoF attains the highest accuracy of 91.89%, with an approximation of 1.45, which is slightly higher than RoF. Also, the total time consumed is maximum for ADASYN for all the classifiers, especially for RoF.
(2) Effect of Undersampling. (3) Effect of Hybrid Sampling. Table 11 describes the comparison between the tree-based classifier model performances due to the hybrid sampling of the data. It can be inferred that SMOTETomek achieves better results in all performance metrics for all classifiers. RREF attains the highest accuracy of 85.83%, with an approximation of 2.32, which is slightly higher than RoF. Also, the total time consumed is maximum for SMOTETomek for all the classifiers, especially for RoF.

Comprehensive
Discussion. e total number of land cover pixels in each band of the IP, SV, and PU datasets is 10249, 54129, and 42776, respectively. Also, SV represents a valley scene, whereas the others represent urban sites. From the tables above, certain inferences can be drawn. As an oversampling technique, SMOTE stands out to be best for IP and PU datasets, but for SV, ADASYN produces the best result throughout all classifiers. SMOTE and ADASYN achieve better outcomes for all the datasets than other oversampling methods. Figure 4 depicts the graphical comparison of the performances of SMOTE and ADASYN in terms of OA%. TL is the best technique for achieving good results for the HS datasets for the undersampling approach. e statistics for NCL have been closer to the outcomes of TL, although it has outperformed TL with DTclassifier for all the datasets. Figure 5 represents the performance comparison between TL and NCL based on OA%. SMOTETomek has surpassed SMOTEENN in all aspects of performance for all the datasets. Figures 6-8 represent all EM data plotted in graphs to understand better the effects of resampling techniques on the HS datasets, viz. IP, SV, and PU. e blue, red, and green curves represent the oversampling, undersampling, and hybrid sampling technique that produces a better result than other resampling techniques. We can infer the more appropriate strategy for further research from these graphical illustrations. Figures 6(a)-6(e) show that the PEMs, viz. precision, recall, F_score, kappa, and OA of the oversampling techniques with the classifier RREF, are the highest for the IP dataset, whereas DT is the least impactful. For hybrid sampling, RoF achieves the best overall PEM scores. e SV dataset obtains the best outcome in oversampling and RREF, whereas there is only a little difference between the performance measures, statistically, that dwells between RoF and RREF. e same scenario applies to the PU dataset in terms of PEMs, as shown in Figures 7(a)-7(e). For the IP dataset, represented by Figures 8(a)-8(e), with the lowest dimension containing less spectral resolutions, RREF performs best for oversampling and undersampling, but RoF stands out in hybrid sampling. e other two datasets, SV and PU, have higher dimensions enriched with more differentiative spectral features. When they undergo undersampling or hybrid sampling, RoF makes the most corrective decisions. ese figures conclude that DT, a simple and basic tree-based ensemble classifier, produces the least accuracy for all datasets. However, RREF is an ensemble of two efficient and state-of-art tree-based ensemble classifiers. It generates maximum OA for all datasets associated with the oversampling techniques, SMOTE and ADASYN, and the undersampling technique, TL. However, hybrid sampling is found to be more compatible with RoF. e calculation and comparison of elapsed time for executing the entire training system, i.e., the time complexity, also play a vital role. is comparison is graphically illustrated in Figure 9 as a 3D bar chart. e figure shows that hybrid sampling is the amalgamation of an oversampler, and an undersampler holds a more complex structure and takes the highest time to build the model. en come the oversampling techniques that generate synthetic samples to recreate the balanced dataset, which takes a certain amount of time. Finally, the order is followed by undersampling strategies whose elapsed time is least due to the deletion of linked neighborhood data samples. is order is thoroughly followed irrespective of the type of the applied tree-based ensemble classifier. e comparison between the elapsed time with units in seconds is also inevitable for the different ensemble treebased classifiers we used in our work. e order of increment in time of execution for those classifier models is consistent: DT < ET < RF < RREF << RoF. According to the previous discussion, RREF and RoF provide the best outcomes for all the resampling strategies. e time taken by RoF is at most 136.85 times (undersampling in IP) and at least 32.12 times (hybrid sampling in PU) higher than RREF. e other TE (in secs) ratios for RREF and RoF, as shown in Figure 9, lie within the said range. e entire study summarizes that oversampling strategies are more compatible with hyperspectral images as they are more consistent than undersampling and hybrid sampling strategies. Oversampling provides the creation of new synthetic instances that inevitably regenerate the existing dataset and bring in a balance between the samples of major and minor classes. Due to the overpopulation of the samples, the dataset does not suffer from the lack of labeled data, and no feature or information is lost, which is an issue in undersampling. e fully balanced dataset is then fed to the tree-based ensemble classifier models as input. e decision trees are made in equilibrium with the balanced samples and produce elementary decisions. ose decision outcomes are passed into forests, and the averaged classification result is ultimately obtained. However, from Tables 2, 5, and 8, we found that SMOTE is mostly better than ADASYN when the ratio of class imbalance is high. is is due to the SMOTEaugmented minority class overall anticipated value being the same as the original minority class expected value, but its variance is lower. As a result, SMOTE has little effect on classifiers that use mean values and total variances to determine categorization rules. ADASYN is also helpful in reducing the learning bias caused by the data distribution of the original imbalanced dataset. e disadvantage of ADASYN over SMOTE is that the procedure is more complex and time-consuming due to its adaptive nature. Additionally, the contrasts are found to be minor between the PEMs derived by RREF and RoF as tree-based ensemble classifier models with each of the oversampling, undersampling, and hybrid sampling procedures. Finally, we can say that the classifier RREF, in combination with the oversampling algorithms such as SMOTE and ADASYN, is capable of producing outstanding classification results for balanced hyperspectral datasets.

Conclusion
Data imbalance has been a delicate issue in big data scenarios with enormous and high-dimensional data. Due to imbalance, classifiers suffer from low accuracy and quality. In our work, we have offered a brilliant study of comparison between the effects of oversampling, undersampling, and hybrid sampling on three highly imbalanced hyperspectral datasets. Furthermore, we rely on the fact presented by previous researchers that the tree-based ensemble classifiers are more useful when the data samples belonging to different classes are nearly balanced. As an effect, we have incorporated a handful of eminent ensemble tree strategies that have achieved remarkable outcomes. For building our models for each resampling strategy and the individual classifier, we executed 50 models for each dataset. For all HS datasets, our findings revealed that in oversampling, SMOTE and ADASYN, while in undersampling, Tomek Links, and in hybrid sampling, SMOTETomek techniques are more compatible with RoF and RREF. Practically and experimentally, oversamplers achieve higher performance statistics than other resampling techniques as they sustain and sometimes add further features for better classification. On the other hand, in undersampling and hybrid sampling, there is a provision for removing redundant data from major classes, which sometimes may lead to loss of information, which affects the classification performance abruptly. Furthermore, there is enough scope to improve all the sampling strategies to become compatible with voluminous real-life datasets for robust applicabilities.
As a limitation to our present work, we can list specific points: (1) the classifiers used need to be cross-validated to produce more optimized outcomes; (2) a limited number of resampling techniques are used with limited hyperparameters; and (3) the computation and time complexity are high. In the future, we plan to incorporate recently developed variants of SMOTE and ADASYN along with more efficient forest ensembles. Also, we will try to explore the TL and its possibility of being optimized to achieve a more accurate outcome.